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Abstract 

Learning joint probability distributions on n random variables requires exponential 
sample size in the generic case. Here we consider the case that a temporal (or causal) 
order of the variables is known and that the (unknown) graph of causal dependencies 
has bounded in-degree A. Then the joint measure is uniquely determined by the 
probabilities of all (2A + l)-tuples. Upper bounds on the sample size required for 
estimating their probabilities can be given in terms of the VC-dimension of the set of 
corresponding cylinder sets. The sample size grows less than linearly with n. 

1 Introduction 

Learning joint probability measures on a large set of variables is an important task 
of statistics. One of the main motivations to estimate joint probabilities is to study 
statistical dependencies and independencies between the random variables In many 
applications the goal is to obtain information on the underlying causal structure that 
produces the statistical correlations. However, the problem of learning causal structure 
from statistical data is in general a deep problem and cannot be solved by statistical 
considerations alone |^, ^. 

Here we do not focus on the problem of uncovering the causal structure, we rather 
address the problem of learning the probability distribution on a large set of variables. 
In general, the sample size required for estimating an unknown measure on the variables 
Xi, . . . , Xn grows exponentially with n. Assume for simplicity that each Xj is a discrete 
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variable with d possible values. Then the probabilities of possible outcomes have to 
be estimated. The sample size can be decreased considerably if prior knowledge on the 
possible correlations is given. Consider for example the trivial case when no statistical 
dependencies are possible at all, i.e., 

P{xi,X2, ...,Xn)= P{xi)P{x2) ■ ■ ■ P{Xn) , 

where xj denotes particular realizations of the corresponding variable Xj. Then one 
has only to learn the probabilities P{xi), . . . , P{xn)- 

There are less trivial examples where prior information on the statistical dependen- 
cies strongly reduce the required sample size. For instance, this information may stem 
from knowledge on the underlying causal structure. Following 0] we encode causal 
structure in a directed graph with random variables as its nodes. Here we assume the 
graph to be acyclic. The decisive prior information assumed to be given here is that 
each variable has at most A parents, i.e., is influenced directly by at most A other 
nodes. Note that we do not assume that we know which nodes are the parents. There- 
fore, our assumption is merely a kind of simplicity assumption on the causation for the 
statistical dependencies. Furthermore, it should be emphasized that in many cases one 
will not find any pair of variables that are statistically independent. The constraints 
on the causal structure for the joint probability measure are more sophisticated and 
are only reflected in conditional probabilities. These constraints are well-known as 
the Markov condition in Bayesian networks |^, ^] . Conversely, Bayesian networks may 
be considered as a convenient and intuitive way of encoding statistical dependencies 
among variables in a graph (without any causal interpretation). 

2 Bayesian networks 

Let us briefly introduce Bayesian networks. To do that we define conditional inde- 
pendence relationships among variables, a central notion in the analysis of probability 
distributions. 

Definition 1 (Conditional independence) 

Let V = {Xi,X2, ■ . . ,Xn} be a finite set of variables. Let P(-) be a joint probability 
distribution over the variables in V, and let X, Y and Z stand for any three subsets 
ofW. The sets X and Y are said to be conditionally independent given Z, denoted by 

(X±Y|Z) (1) 

if 

P(x, y jz) = P(x|z)P(y |z) , whenever P{z) > , (2) 
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where x is the tuple denoting a particular realization of the values of the variables in 
X and the tuples y and z are defined analogously. In words, if all the actual values of 
the variables in Z are known the actual values of the variables in Y do not provide any 
further information on the actual values of the variables in X. 

Directed acyclic graphs or Bayesian networks - a term coined in Q - are used to 
facilitate economical representation of joint probability distributions. The basis de- 
composition scheme offered by directed acyclic graphs can be illustrated as follows. 
Let P(-) be a joint probability distribution as in Definition 1. The chain rule of proba- 
bility calculus always permit to decompose P as a product of n conditional probability 
distributions: 

n 

P{xi, ... ,Xn) = Y]_P{xj\xi, . . . ,Xj-i) . (3) 
J=l 

Now suppose that the conditional probability of some variable Xj is not sensitive to 
all the predecessors of Xj but only to a small subset of those predecessors. In words, 
suppose that Xj is independent of all other predecessors, once we know the values of 
a selected group of predecessors called Pj := {Xj^i, . . . ,Xj^rnj}- We can then write 

n 

P{XU... ,Xn) = llP{Xj\pj) (4) 

i=i 

considerably simplifying the input information. Instead of specifying the probability 
of Xj conditional on all possible realizations of its predecessors Xi, . . . , Xj^i, we need 
only to take into account the possible realizations of the set P^. The set Pj is called the 
Markovian parents of Xj, or the parents for short. The reason for the name becomes 
clear when we introduce graphs around this concept. 

Definition 2 (Markov parents) 

Let V = {Xi, . . . ,Xn} be an ordered set of variables, and letP{-) be the joint probability 
distribution on these variables. A set of variables Pj is said to be Markovian parents 
of Xj if Pj is a minimal set of predecessors of Xj that renders Xj independent of all 
its other predecessors. In words, Pj is any subset of {Xi, . . . satisfying 

P{xj\-pj) = P{xj\xi,... ,Xj-i) (5) 

such that no proper subset of Pj satisfies Eq. ^). 

This definition assigns to each variable Xj a selected set Pj of preceding variables 
that are sufficient for determining the probability of Xj. The values of the other 
preceding variables are redundant once we know the values pj of the parent set Pj. 
This assignment can be encoded in a directed acyclic graph in which the variables are 
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represented by the nodes and arrows are drawn from each node of the parent set toward 
the child node Xj. 

Furthermore, Definition 2 also provides a simple recursive method for constructing 
such a DAG: Starting with the pair (Xi,X2), we draw an arrow from Xi to X2 if and 
only if the two variables are dependent. Assume that we have constructed the DAG 
up to node j — 1. At the jth stage, we select any minimal set of predecessors of Xj 
that renders Xj independent from its other predecessors (as in Eq. (^)), call this set 
Pj and draw an arrow from each member in Pj to Xj. The result is a directed acyclic 
graph, called a Bayesian network, in which an arrow from Xi to Xj assigns Xi as a 
Markovian parent of Xj, consistent with Definition 2. 

Let us mention that the set Pj is unique whenever the distribution P(-) is strictly 
positive, i.e. every configuration of variables, no matter how unlikely, has some finite 
probability of occurring. Under such conditions, the Bayesian network associated with 
P(-) is unique, given the ordering of the variables 

Definition 3 (Markov Compatibility) 

Let G he a DAG. If a probability distribution P admits a factorization relative to G, 
i.e. 

n 

P{XI, ... ,Xn) = '[[ P{Xj = Xj\Pj = pj) , (6) 

i=i 

where Pj are the parents of the node Xj defined by the graph G, then we say G and P 
are compatible, or that P is Markov relative to G. 

The problem of learning a Bayesian network usually treated in the literature is 
as follows. Given a training set {x^,... ,x'}, find a network that best matches the 
training set (see e.g. [Q, |2|), i.e. to determine a graph G such that P is Markov relative 
to G. 

3 Networks with bounded in-degree 

To motivate our decisive assumption we would like to note that scientific reasoning 
always tries to find a simple explanation for the data ("Occam's Razor"). We are 
aware of the fact that "simplicity" is hard to formalize. However, it seems reasonable 
to try to explain data by simple causal graphs. Here we may use the in-degree of the 
graph as criterion for simplicity. It is defined as the greatest number of parents that 
occurs. The intuitive meaning of in-degree A is that no variable is directly infiuenced 
by more than A others. For A ^ n we call the graph sparse. Clearly, the in-degree 
is only one of the graph theoretical notions that may be used to define simplicity of 
causal explanations; we could use e.g. the number of edges. 
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Let G be an arbitrary DAG with in-degree A. Then every probabihty measure that 
is Markovian relative to G is aheady determined by the probabihties of ah (A + 1)- 
tuples. This follows directly from the decomposition in Eq. (|5|) since the conditional 
probabilities P{xj\pj) are the quotients of the probabilities P{xj,pj) and P{pj) of 
sizes at most A + 1 and A, respectively. Consequently, if G is known we can learn the 
probability measure P by learning the probabilities of all A + 1-tuples. 

In contrast, we do not assume that we know the exact structure of G but only that 
its in-degree at most A. Now the situation is more complicated. Since we do not know 
the set of parents for any Xj, we do not know which conditional probabilities have 
to appear in the factorization in Eq. (|5|). Therefore, it is not sufficient to know the 
probabilities of all tuples of size A+1 to reconstruct the structure. We have to know the 
probabilities of at least all (A + 2)-tuples to be able to test conditional independencies. 
The following theorem shows that it is sufficient to know the probabilities of all (2A + 1) 
tuples. 

Theorem 4 (Graph structure from correlations) 

Let Xi < X2 < . . . < Xn he an ordering of the variables. Assume that P is a 
probability measure that is Markov relative to a directed acyclic graph (DAG) G. Let G 
be consistent with the ordering, i.e., the graph G contains no arrow from Xj to Xi for 
i < j. Let G have in-degree A and assume that the probabilities of all (2 A + l)-tuples 
are known. Then we can find a graph G (possibly different from G) that is Markov 
relative to P and has at most in-degree A. 

Proof: We can find the correct graph structure by the following iteration: Draw an 
arrow from Xi to X2 if the two variables are dependent. Assume we have found the 
correct structure on Xi, X2, . . . , Xj^i. 

In order to find a possible minimal set Pj of parents of Xj we proceed as follows: Let 
m := min{j — 1, A}. For each m-subset K <^ Vj := {Xi,X2, . . . , Xj^i} test whether 
the following statement is true: 

{Xj _L L I K) for all sets L (disjoint from K) that contain at most m elements. 

If this is true, K contains necessarily a set P^- that can be taken as Markovian 
parents of Xj. This can be seen as follows: Choose L such that (L U K) 5 Pj for 
an arbitrary minimal choice of parents of Xj. This is possible since Xj has at most 
m parents. Since LU K contains the parents of Xj it renders Xj independent of its 
predecessors (see the d-separation criteria in 0, § ) • Formally we have {Xj ±Vj\LLi 
K). By the contraction rule for conditional independencies (see Q) the statements 
{Xj _L V^- I L U K) and {Xj _L L \ K) imply {Xj _L Vj \ K). Hence K must contain a set 
Pj that can be viewed as Markovian parents of Xj . 

Now we can test whether a proper subset K' of K satisfies {Xj _L L \ K') and obtain 
a minimal set of parents of Xj by iterating this procedure. □ 
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4 Learning the probabilities of /c-tuples 



Now we shall present an upper bound on the required sample size in order to learn the 
probabilities of all fc-tuples with good reliability. Then we can apply this result to the 
case k := 2A + 1. 

Let P(-) be a probability distribution over an (ordered) set of random variables 
V = {Xi, . . . , Xn} taking on values in for j = 1, . . . , n. 
Let Xj-^ ; • • • ) -^jk 

be any fc-subset of V. We would like to have a reliable statement 
on the probability of the event (xji, . . . ^xj^^) G Qj-^ x • • • x fij^, i.e. the probability 



The problem to determine the sample size required for estimating reliably the proba- 
bility of one specific event is a usual problem of statistics. However, the problem we 
encounter in learning Bayesian networks is more sophisticated: we have to be almost 
sure that the estimated probabilities of all (2A + l)-tuples are sufficiently close to the 
real (unknown) probabilities. 

The problem to determine whether and how fast the relative frequencies of a 
large set of events converge uniformly to their probabilities is well-known in statis- 
tical learning theory ||8|. Statements on uniform convergence rely on the so-called 
Vapnik-Chervonenkis dimension (VC-dimension) of the considered set of events. 

Definition 5 (VC dimension) 

Let P be an unknown probability measure on a probability space and S a set of events, 
i.e., a set of measurable subsets of Vt. Define the VC-dimension of S := {Mx) as the 
largest number h such that there exist h points uji,i02, ■ ■ ■ ,uJh G ^ such that the sets 
M\ n {wi, . . . , uoh} fun over all 2^ subsets of {wi, . . . , ooh}- Intuitively, one can consider 
the sets M\ as classifiers and the VC-dimension as the largest number of points that 
can be classified in all 2^ possible ways. The VC-dimension is said to be infinite if such 
an h-subset can be found for all /i € N. 

A trivial upper bound on the VC-dimension is given by the logarithm to base 2 of 
the number of events (in the case that S is finite). 

Finite VC-dimension is known to be sufficient and necessary in order to have uni- 
form convergence of relative frequencies to their probabilities. Quantitatively, one has 
the following theorem: 

Theorem 6 (Uniform convergence) 

Let f{M) be the relative frequency of the number of occurrences of M after I runs. 
Let S have VC-dimension h. Let be the risk (probability) that S contains at least 
one set M such that \f{M) — P[M)\ > e for an arbitrary positive e. Then we have 



• • • ' -Xjk ~ ) • 



(7) 
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h{l+\n{2l/h)) 
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Proof: see Theorem 4.4. in ||8| □ 

This theorem allows to derive a lower bound on the required sample size in order 
to estimate the probability of all fc-tuples. First we have to define the set of events and 
give an upper bound on its VC-dimension. 

Let := r^i X • • • X il„ be the probability space. This means that the jth random 
variable takes on values from Qj for j = 1,... ,n. The fe-tuples are characterized 
by the positions and values the corresponding random variables take on. Let j := 
{j'l , ^2 , . . . , j'fc } be an arbitrary /c-subset of { 1 , . . . , n} and x G Jljj , • • • , • 

We then denote by Mx the event that the random variables Xj-^ , Xj^ , ■ ■ ■ , Xj^. take 
on the values Xj^, . . . , Xj,^. This event corresponds uniquely to a cylinder set Ci C 0. 

An upper bound on the VC-dimension of the set of those events that correspond to 
cylinder sets Cx is easy to get. Let d be the maximal cardinality of the sets ^Ij. Then, 
for fixed k, there exist at most 




such cylinder sets. The first term gives an upper bound on the possible combinations of 
values and the second term the number of different positions. This number is smaller 
than (nd)^. By taking the logarithm to base 2 we obtain an upper bound on the 
VC-dimension 

h<klog2{nd) (9) 

Obviously, we can use much better bounds for concrete applications, e.g. given by Stir- 
ling's approximation (giving a less intuitive expression but providing a tighter bound). 
However, this crude upper bound is sufficient to study the asymptotic behavior. 

Now we will present a lower bound on the VC-dimension in order to get an idea 
how tight the upper bound in (^) is. 

We construct I n-tuples with / := [log2(n — A; + 1)J as follows. For each set ilj we 
choose two different values xj-q and xj-i for j = 1, . . . ,n. This defines a map (p from 
the set of binary words of length n into by setting 

(p : bib2 ...bn^ Xi^biX2,b2 ■ ■ ■ Xn,b„ ■ (10) 

Now we define an I x n matrix M with entries and 1 as follows: The first k — 1 
columns have only 1 as entries. The next 2' columns are the binary words of length I. 
The remaining (n — /c + 1 — 2') columns can be chosen arbitrarily. 

The rows of M correspond to n-tuples by the map (p. Let y be the set of those 
n-tuples and S be an arbitrary subset of y. S can uniquely be characterized by a 
vector s of length / with entries and 1 where the j-th entry of s indicates whether the 
j-th n-tuple is an element of S or not. The matrix M contains a column that coincides 
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with s. Assume it to be the i-th column. Than Ci_ ny contains exactly those n-tuples 
that are elements of S provided that Ci is chosen as follows. Let j be (1, 2, . . . , A; — 1, i) 
and choose x as the /j-tuple (xi;i,X2;i, • • • ,Xk~i-i,Xi-i). This shows that the cylinder 
sets corresponding to /c-tuples are able to classify y on all 2' possibilities. Therefore 
[log2(ra — /c + l)J is a lower bound on the VC-dimension of the cylinder sets. Comparing 
this bound with the upper bound in (|9|), we see that it gives the correct asymptotic 
behavior in the O-notation if k and d are considered as constants. 

Theorem 7 For e > let be the risk that there is a cylinder set Cx such that its 
relative frequency deviates from its probability by more than e. Than Re can be made 
smaller than any (5 > while only increasing the sample size linearly with n. 



Proof: We choose I such that 

I (6 - l/lf 



> k log2 (nd) . 



1 + ln(2/) 2 

This can asymptotically be achieved by increasing / with 0{n), since 1/(1 + ln(2/)) < 
l/(ln(l)) and the latter term increases less than linearly in /. 
Using our bound 



we obtain 



and get 



h < 



h < klog2{nd) 

(6 - l/lf I 



1 + ln(2/) 



By elementary calculation, this implies 



h{l + ln(2/)) ^ (e-1/0^ 



/i(l+ln(2///i)) (^_i/;)2< ie-l/lf 



I V , . - 2 

Using the bound of Theorem 6 this shows that the risk R^ can even be made to decrease 
exponentially in n while increasing the sample size / only linearly in n. □ 
Note that the sample size has to be chosen such that the deviation of the relative 
frequencies from their probabilities is small compared to the relative frequencies. Then 
we have a reasonable criterion to decide for which sets X,Y,Z of variables we may 
assume X and Y to be independent given Z. This criterion is as follows: Based on 
the error bound of Theorem ^ we compute the relative uncertainty of the conditional 
probabilities used in the algorithm in the proof of Theorem If the observed statistical 
dependencies are greater than the uncertainty we assume the variables to be dependent. 
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5 Conclusions 



The sample size to learn the joint probability distribution on n nodes does only increase 
linearly with n if the underlying causal structure is assumed to be sufficiently simple. 
Here we considered the case that we know that the (unknown) causal graph has at 
most in-degree A and a known time order exists. Than a graph that is Markov relative 
to the unknown probability measure can be found efficiently if only the probabilities 
of all (2A + l)-tuples are known. They can be learned with linear sample size. We 
have shown this by finding bounds on the VC-dimension of the corresponding cylinder 
sets. We would like to note that the causal structure can at least be guessed if only 
the probabilities of (2A + l)-tuples are known, since they allow to test a large number 
of statistical independencies. 
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